Discovery Processes in Discovery Net
نویسندگان
چکیده
The activity of e-Science involves making discoveries by analyzing data to find new knowledge. Discoveries of value cannot be made by simply performing a pre-defined set of steps to produce a result. Rather, there is an original, creative aspect to the activity which by its nature cannot be automated. In addition to finding new knowledge, discovery therefore also concerns finding a process to find new knowledge. How discovery processes are modeled is therefore key to effectively practicing e-Science. We argue that since a discovery process instance serves a similar purpose to a mathematical proof it should have similar properties, namely it allows results to be deterministically reproduced when re-executed and that intermediate results can be viewed to aid examination and comprehension. 1. Analysis for Discovery The massive increase in data generated from new high-throughput methods for data collection in life sciences has driven interest in the emerging area of e-Science. Practitioners in this field are scientific ‘knowledge workers’ performing analysis in-silico. The traditional methodology of formulating a hypothesis, designing an experiment to test the hypothesis, running the experiment then studying the results and evaluating is now largely computerised. Experimental data in e-Science is either captured electronically or generated by a software simulation. For example, in gene expression analysis the activity of the whole genome may be recorded using microarray technology. e-Science concerns the development of new practices and methods to find knowledge. Nontrivial, actionable knowledge cannot be batch generated by a set of predefined methods, but rather the creativity and expertise of the scientist is necessary to formulate new approaches. Whilst the dynamic nature of massively distributed service-oriented architectures provides much promise to provide scientists with powerful tools, they raise many issues of complexity. New resources such as online data sources, algorithms and methods defined as processes are becoming available daily. A single process may need to integrate techniques from a range of disciplines such as data mining, text mining, image mining, bioinformatics, cheminformatics and be created by a multidisciplinary team of experts. A major challenge is to effectively coordinate these resources in a discovery environment in order to create knowledge. The purpose of a discovery environment is to allow users to perform their analyses, computing required results and creating a record of how this is accomplished. The most widespread and promising approach is to use process-based systems with re-usable components that allow the end user to compose new methods of analysis through a visual programming interface. Workflow systems and analysis software are already well established in the sphere of business but the requirements for scientific disciplines are markedly different owing to the nature of discovery. Two major uses of process based systems in business are for production workflows and business intelligence. Production workflows[1], which grew from document workflow systems, are primarily concerned with tying production systems such as operational databases together so that an organisation can streamline its regular, day to day activities. Such workflows are relatively fixed, and authored by software engineers based on models of accepted practice to the extent that systems in certain domains[2] come predefined with processes and databases. It is not uncommon for businesses to change their actual practices to fit predefined process supplied with such a system. In production workflow systems the emphasis is on repeatedly performing a fixed set of processes in order for their constituent activities to be repeatedly performed. Business intelligence concerns volumes of data that are too large to be handled manually thus requiring computer-based approaches. Knowledge Discovery in Databases (KDD)[3] is a field that concerns the use of techniques from machine learning, statistics and visualisation to enable users to find useful, actionable patterns in large quantities of data. It is a multi-disciplinary collaborative activity with data, algorithm and domain experts working side by side on a project. Knowledge discovery has been important in business for many years, and the techniques used for analysis have relevance in e-Science, but the nature of processes and how they are constructed is markedly different. In e-Science the dynamic nature of processes used and the operations they are composed of are of central importance. A primary function of practitioners is to create new processes, which may utilise new resources as they become available on a daily basis. Each process instance is an experiment in itself, described in terms of specific data whose aim is to determine the validity of the approach. Processes must be rapidly prototyped with a tight edit-execute-edit cycle. The need for such dynamism means that for creative scientific work a new kind of environment and process representation are needed to best support the user. 2. Provenance and Audit of Discovery Provenance is a loaded word with many possible definitions and interpretations. In the field of antiques it simply means how authenticity is proven. In e-Science provenance clearly concerns a record of the origins of data for such purposes as peer review and reproduction of results. A static record of events that occurred are readily recorded, but creating a dynamically re-executable artefact is a difficult problem that relies on properties of the environment in which the result is calculated. An e-Science process will typically make use of shared, distributed resources for execution and as sources of data. Within an organisation, LIMS (Laboratory Information Management Systems), for example, can bridge the gap between the physical world and the computerised environment and provide a robust cornerstone for all data exploration, recording how data was captured. In contrast, for public resources there is scant support for versioning, and the worst casebasic day-to-day availability of resourcesmust be carefully considered by organisations that rely on such services to produce their intellectual property. Audit is another loaded word, with similar scope for interpretation. In the simplest sense it is very similar to provenancea record of the method of how a result was produced. However, a fully detailed record would include how the methodology evolved to reach its final configuration to reach an end result. Here we examine the following issues of provenance and audit that need addressing for discovery: • Data archiving • Operation versioning • Provenance of a computed result • Provenance of a discovery process Data archiving concerns the use of data sources used in the scope of our discovery process instance. Operation versioning concerns managing changes to executable resources such as analysis algorithms over time. Provenance of a computed result concerns derived data or results that have been locally computed from a discovery process. Provenance of a discovery process concerns how a discovery process was authored.
منابع مشابه
Effect of Positive and Negative Embodies on Attractiveness Processes
Background and Objective: Human being as a social-emotional being is constantly adapting to the environment. In this regard, interpersonal excitement is very important in terms of interpersonal communication as well as in terms of social perception. Therefore, the purpose of this study was to investigate the effect of positive and negative excitement induced on attention processes. Method...
متن کاملPharmaceutical Advances and Proteomics Researches
Proteomics enables understanding the composition, structure, function and interactions of the entire protein complement of a cell, a tissue, or an organism under exactly defined conditions. Some factors such as stress or drug effects will change the protein pattern and cause the present or absence of a protein or gradual variation in abundances. Changes in the proteome provide a snapshot of the...
متن کاملPharmaceutical Advances and Proteomics Researches
Proteomics enables understanding the composition, structure, function and interactions of the entire protein complement of a cell, a tissue, or an organism under exactly defined conditions. Some factors such as stress or drug effects will change the protein pattern and cause the present or absence of a protein or gradual variation in abundances. Changes in the proteome provide a snapshot of the...
متن کاملDesigning an Ontology for Knowledge Discovery in Iran’s Vaccine
Ontology is a requirement engineering product and the key to knowledge discovery. It includes the terminology to describe a set of facts, assumptions, and relations with which the detailed meanings of vocabularies among communities can be determined. This is a qualitative content analysis research. This study has made use of ontology for the first time to discover the knowledge of vaccine in Ir...
متن کاملDrug Discovery Acceleration Using Digital Microfluidic Biochip Architecture and Computer-aided-design Flow
A Digital Microfluidic Biochip (DMFB) offers a promising platform for medical diagnostics, DNA sequencing, Polymerase Chain Reaction (PCR), and drug discovery and development. Conventional Drug discovery procedures require timely and costly manned experiments with a high degree of human errors with no guarantee of success. On the other hand, DMFB can be a great solution for miniaturization, int...
متن کاملWeighted-HR: An Improved Hierarchical Grid Resource Discovery
Grid computing environments include heterogeneous resources shared by a large number of computers to handle the data and process intensive applications. In these environments, the required resources must be accessible for Grid applications on demand, which makes the resource discovery as a critical service. In recent years, various techniques are proposed to index and discover the Grid resource...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004